New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add GPU implementation of NMSv2 op #28745

Merged

tensorflow-copybara merged 11 commits into tensorflow:master from samikama:GPU_NMSv2

Jun 10, 2019

Contributor

samikama commented May 15, 2019

This PR adds a GPU implementation of NMSv2 Op. It also registers a FakeGPU op for CombinedNonMaxSuppression op to workaround issues encountered due to lack of GPU implementation until a proper GPU implementation can be done based on current GPU kernels.


          Add GPU implementation of NonMaxSuppressionV2 op and a FakeGPU versio…

2269de2

…n of CombinedNonMaxSuppression op for Funcdef executions in TFTRT fallback path

samikama requested review from tatianashp and aaroey

May 15, 2019 20:13

tensorflow-bot bot added the size:L label

googlebot added the cla: yes label

Contributor Author

samikama commented May 15, 2019

@tfboyd This is the first part of PRs that would improve performance on object detection networks.


          Add missing CUB dependency to BUILD file

7aeafed

Contributor Author

samikama commented May 15, 2019

Test for new op is blocked by the #28744 since GPU tensors are not correctly transferred to host without it.

aaroey removed the request for review from tatianashp

May 15, 2019 20:52

rthadur self-assigned this

rthadur added this to Assigned Reviewer in PR Queue via automation

rthadur added the comp:ops label

This was referenced May 15, 2019

[TF-TRT] Add python test for CombinedNMS #28749

Merged

TFTRT: Allow native segment to use CPU and move outputs to GPU #27968

Closed

samikama added 2 commits

May 15, 2019 15:14


          Add missing typedef

942e41e


          move registration of fake GPU to non_max_suppression_op.cc file since…

3780eda

… header doesn't declare it

samikama mentioned this pull request

Generate box proposals op #28754

Merged

aaroey reviewed

View reviewed changes

tensorflow/core/kernels/non_max_suppression_op.cc Outdated Show resolved Hide resolved


          Move FakeGPU CombinedNonMaxSuppression op to a different PR

samikama added a commit to samikama/tensorflow that referenced this pull request


          Move FakeGPU CombinedNonMaxSuppression op from tensorflow#28745 to th…

c624f58

…is PR

samikama mentioned this pull request

Add CombinedNMS FakeGPU op #28808

Closed

rthadur requested a review from aaroey

May 17, 2019 20:21

samikama mentioned this pull request

Add RoiAlign Op #28746

Closed

aaroey requested a review from chsigg

May 22, 2019 19:45

Member

aaroey commented May 22, 2019

Hi @chsigg, could you please help to take a look at this PR?
Thanks.

chsigg reviewed

View reviewed changes

tensorflow/core/kernels/non_max_suppression_op.cu.cc Outdated Show resolved Hide resolved

tensorflow/core/kernels/non_max_suppression_op.cu.cc Outdated

+                  void NMSKernel(const Box* d_desc_sorted_boxes, const int nboxes,
+                                 const float thresh, const int mask_ld, int* d_delete_mask,
+                                 bool flip_boxes = false) {
+                // Storing boxes used by this CUDA block in the shared memory

Contributor

chsigg May 24, 2019

Comments should end with a '.'.

tensorflow/core/kernels/non_max_suppression_op.cu.cc Outdated

+                    // One 1D line load the boxes for x-dimension
+                    if (threadIdx.y == 0) {
+                      const Box box = d_desc_sorted_boxes[i_to_load];
+                      Box flipped = box;

Contributor

chsigg May 24, 2019

I would do this on 'box' directly with swap. It's unexpected to call this flipped when it's only flipped if flip_boxes is true.

tensorflow/core/kernels/non_max_suppression_op.cu.cc Outdated

+              __launch_bounds__(NMS_BLOCK_DIM* NMS_BLOCK_DIM, 4) __global__
+                  void NMSKernel(const Box* d_desc_sorted_boxes, const int nboxes,
+                                 const float thresh, const int mask_ld, int* d_delete_mask,
+                                 bool flip_boxes = false) {

Contributor

chsigg May 24, 2019

Prefer no default arguments.

Would it help performance to make this a template parameter?

tensorflow/core/kernels/non_max_suppression_op.cu.cc Outdated

+                    }
+                  }
+                  __syncthreads();
+                  const int i = i_block_offset + threadIdx.x;

Contributor

chsigg May 24, 2019

This is the same as i_to_load, no?

tensorflow/core/kernels/non_max_suppression_op.cu.cc Outdated Show resolved Hide resolved

tensorflow/core/kernels/non_max_suppression_op.cu.cc

+                // both take about the same time
+                int nto_copy = std::min(NMS_CHUNK_SIZE, N);
+                cudaEvent_t copy_done;
+                cudaEventCreate(&copy_done);

Contributor

chsigg May 24, 2019

Use stream_executor::Event

Contributor Author

samikama May 29, 2019

@csigg I couldn't be able to find any examples of using stream_executor::Event in a similar fashion elsewhere in kernels. Even then, I don't think it is possible to implement the logic by stream_executor::Event since there is no mechanism equivalent to cudaEventSynchronize() implemented in the framework. I can try to spin on event::poll but that would be quite inefficient and would probably hinder the rest of the framework as well due to acquired locks. I would have preferred to use ThenExecute() chaining these but it would require all NMS ops to be converted to AsyncOps as well as a proper threadpool on event manager. Currently all events are executed on single thread and doing work there would block the event infrastructure. I can spawn the work on cpu device thread pool on the event callback but I am not sure if this level of complexity is justified.
How would you propose I would use stream_executor::Event, it is possible that I am missing something obvious.

tensorflow/core/kernels/non_max_suppression_op.cu.cc

+                explicit NonMaxSuppressionV2GPUOp(OpKernelConstruction* context)
+                    : OpKernel(context) {}
+                void Compute(OpKernelContext* context) override {

Contributor

chsigg May 24, 2019

Add comment what the implementation does. Inline (like, above sections below) is also fine.

tensorflow/core/kernels/non_max_suppression_op.cu.cc Outdated Show resolved Hide resolved

tensorflow/core/kernels/non_max_suppression_op.cu.cc Outdated Show resolved Hide resolved

samikama added 2 commits

May 28, 2019 18:44


          Merge branch 'master' of https://github.com/tensorflow/tensorflow int…

92fa4fc

…o GPU_NMSv2


          Addressing review comments

e5ade93

rthadur requested a review from chsigg

May 29, 2019 05:52

aaroey requested changes

View reviewed changes

tensorflow/core/kernels/non_max_suppression_op.cu.cc Show resolved Hide resolved

tensorflow/core/kernels/non_max_suppression_op.cu.cc Outdated Show resolved Hide resolved

tensorflow/core/kernels/non_max_suppression_op.cu.cc Outdated Show resolved Hide resolved

tensorflow/core/kernels/non_max_suppression_op.cu.cc Outdated Show resolved Hide resolved

tensorflow/core/kernels/non_max_suppression_op.cu.cc Outdated Show resolved Hide resolved

tensorflow/core/kernels/non_max_suppression_op.cu.cc Outdated Show resolved Hide resolved

tensorflow/core/kernels/non_max_suppression_op.cu.cc Outdated Show resolved Hide resolved

tensorflow/core/kernels/non_max_suppression_op.cu.cc Outdated Show resolved Hide resolved

tensorflow/core/kernels/non_max_suppression_op.cu.cc Outdated Show resolved Hide resolved

tensorflow/core/kernels/non_max_suppression_op.cu.cc Outdated Show resolved Hide resolved

tensorflow-bot bot added kokoro:force-run ready to pull labels

kokoro-team removed the kokoro:force-run label

Member

aaroey commented Jun 7, 2019

Sorry for the delay, there are some internal test failures and I'm still trying to fix them.

tensorflow-copybara merged commit 7f78ad5 into tensorflow:master

PR Queue automation moved this from Approved by Reviewer to Merged

tensorflow-copybara pushed a commit that referenced this pull request


          Merge pull request #28745 from samikama:GPU_NMSv2

PiperOrigin-RevId: 252461000

SpaceInvader61 mentioned this pull request

non_max_suppression is very slow and doesn't appear to have a cuda or multi-threaded implementation #7511

Closed

pooyadavoodi pushed a commit to pooyadavoodi/tensorflow that referenced this pull request


          Move FakeGPU CombinedNonMaxSuppression op from tensorflow#28745 to th…

fd4a2d6

…is PR

pooyadavoodi pushed a commit to pooyadavoodi/tensorflow that referenced this pull request


          Move FakeGPU CombinedNonMaxSuppression op from tensorflow#28745 to th…

6a6eae6

…is PR

pooyadavoodi pushed a commit to pooyadavoodi/tensorflow that referenced this pull request


          Move FakeGPU CombinedNonMaxSuppression op from tensorflow#28745 to th…

7537bd3

…is PR

Contributor

ppwwyyxx commented Jul 18, 2019

I believe this implementation is wrong: it does not agree with the CPU version of NMS Op.

In this implementation, when computing the area in IOU, it uses (x2-x1+1)*(y2-y1+1), as can be seen at:

tensorflow/tensorflow/core/kernels/non_max_suppression_op.cu.cc

Lines 96 to 98 in e3062d1

    
           const float w = fdimf(xx2 + 1.0f, xx1); 
        
           const float h = fdimf(yy2 + 1.0f, yy1); 
        
           const float intersection = w * h;

However, in the CPU version, it uses (x2-x1)*(y2-y1), as can be seen at:

tensorflow/tensorflow/core/kernels/non_max_suppression_op.cc

Lines 126 to 129 in e3062d1

    
           const T intersection_area = 
        
               std::max<T>(intersection_ymax - intersection_ymin, static_cast<T>(0.0)) * 
        
               std::max<T>(intersection_xmax - intersection_xmin, static_cast<T>(0.0)); 
        
           return intersection_area / (area_i + area_j - intersection_area);

For many inputs this may not have an effect at all. But for certain inputs the two versions will produce inconsistent results.

If your goal is to run object detection models, note that the "+1" is a legacy issue and we're trying to avoid the version with "+1" in Facebook. See this PR that handles "+1" in caffe2.

Member

aaroey commented Jul 18, 2019

@samikama would you please fix the issue @ppwwyyxx mentioned above?

SpaceInvader61 commented Jul 19, 2019 •

edited

@samikama I tried to use your kernel from inside python and I am getting a segmentation fault by running this simple script:

import tensorflow as tf

tf.enable_eager_execution()

from tensorflow.python.ops import gen_image_ops

with tf.device("/device:GPU:0"):
    boxes = tf.constant([[1.0, 1.0, 1.0, 1.0]], dtype=tf.float32)
    scores = tf.constant([1.0], dtype=tf.float32)
    max_output_size = tf.constant(10, dtype=tf.int32)
    iou_threshold = tf.constant(0.7, dtype=tf.float32)
    score_threshold = tf.constant(float('-inf'), dtype=tf.float32)
    print("Start")
    x = gen_image_ops.non_max_suppression_v2(boxes, scores, max_output_size, iou_threshold, score_threshold)
    print("End")
    print(x)

docker run --runtime=nvidia -it -v $PWD:/tf -w /tf tensorflow/tensorflow:nightly-gpu-py3
 python pyscript.py

The output is

Start
Segmentation fault

Correct me if I am doing smth wrong

Contributor Author

samikama commented Jul 20, 2019 •

edited

@ppwwyyxx Thanks for catching that. I made the fixes to support both legacy case and CPU identical implementation in #30893.
@SpaceInvader61 It looks like we missed that some input tensors need to be host tensors when they are changed from attributes to tensors. Can you try with #30893? Also your example is probably not really making use of the nms_v2 since the signature of nms_v2 is

non_max_suppression_v2(boxes, scores, max_output_size, iou_threshold, name=None)

Another point is you are passing a box with 0 surface area and that is the only box. Even though there is a single box test in the test suite, we didn't have an invalid box test. I will add the fix for it in an upcoming PR.

SpaceInvader61 commented Jul 20, 2019

@samikama with .HostMemory() it works, thank you so much! 👍

ppwwyyxx mentioned this pull request

Bug in IOU calculation of NMS for GPU #31541

Closed

ppwwyyxx mentioned this pull request

GPU NMS kernel in TF 1.15rc0 was not fixed #32401

Closed

AkshayKhatriKodiak commented Sep 24, 2019

Has this made it into any of the tensorflow releases? As far as I know, it wasn't included in 1.13 and 1.14. How about in tensorflow 2.0?

pooyadavoodi pushed a commit to pooyadavoodi/tensorflow that referenced this pull request


          Move FakeGPU CombinedNonMaxSuppression op from tensorflow#28745 to th…

ff0efcd

…is PR

DEKHTIARJonathan pushed a commit to DEKHTIARJonathan/tensorflow that referenced this pull request


          Move FakeGPU CombinedNonMaxSuppression op from tensorflow#28745 to th…

7235b87

…is PR

DEKHTIARJonathan pushed a commit to DEKHTIARJonathan/tensorflow that referenced this pull request


          Move FakeGPU CombinedNonMaxSuppression op from tensorflow#28745 to th…

d743d55

…is PR

DEKHTIARJonathan pushed a commit to DEKHTIARJonathan/tensorflow that referenced this pull request


          Move FakeGPU CombinedNonMaxSuppression op from tensorflow#28745 to th…

fe8bcde

…is PR

DEKHTIARJonathan pushed a commit to DEKHTIARJonathan/tensorflow that referenced this pull request


          Move FakeGPU CombinedNonMaxSuppression op from tensorflow#28745 to th…

acf8c63

…is PR

DEKHTIARJonathan pushed a commit to DEKHTIARJonathan/tensorflow that referenced this pull request


          Move FakeGPU CombinedNonMaxSuppression op from tensorflow#28745 to th…

615f4fd

…is PR

DEKHTIARJonathan pushed a commit to DEKHTIARJonathan/tensorflow that referenced this pull request


          Move FakeGPU CombinedNonMaxSuppression op from tensorflow#28745 to th…

6eed4c2

…is PR

DEKHTIARJonathan pushed a commit to DEKHTIARJonathan/tensorflow that referenced this pull request


          Move FakeGPU CombinedNonMaxSuppression op from tensorflow#28745 to th…

817fd05

…is PR

DEKHTIARJonathan pushed a commit to DEKHTIARJonathan/tensorflow that referenced this pull request


          Move FakeGPU CombinedNonMaxSuppression op from tensorflow#28745 to th…

254304b

…is PR

DEKHTIARJonathan pushed a commit to DEKHTIARJonathan/tensorflow that referenced this pull request


          Move FakeGPU CombinedNonMaxSuppression op from tensorflow#28745 to th…

5ffe048

…is PR

nouiz pushed a commit to nouiz/tensorflow that referenced this pull request


          Move FakeGPU CombinedNonMaxSuppression op from tensorflow#28745 to th…

dbfc3ee

…is PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment